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Abstract. Exploiting the full computational power of always deeper hi- 
erarchical multiprocessor machines requires a very careful distribution of 
threads and data among the underlying non-uniform architecture. The 
emergence of multi-core chips and NUMA machines makes it impor- 
tant to minimize the number of remote memory accesses, to favor cache 
affinities, and to guarantee fast completion of synchronization steps. By 
using the BubbleSched platform as a threading backend for the GOMP 
OpenMP compiler, we are able to easily transpose affinities of thread 
teams into scheduling hints using abstractions called bubbles. We then 
propose a scheduling strategy suited to nested OpenMP parallelism. The 
resulting preliminary performance evaluations show an important im- 
provement of the speedup on a typical NAS OpenMP benchmark appli- 
cation. 

Keywords: OpenMP, Nested Parallelism, Hierarchical Thread Schedul- 
ing, Bubbles, Multi-Core, NUMA, SMP. 



1 Introduction 

The emergence of deeply hierarchical architectures based on multi-threaded 
multi-core chips and NUMA machines raises the need for a careful distribution of 
threads and data. Indeed, cache misses and NUMA penalties become more and 
more important with the complexity of the machine, making these constraints as 
important as parallelization. They require some new programming models and 
new tools to make the most out of these underlying architectures. 



As quoted by Gao et al. [GSSjl06|, it is important to expose domain-specific 
knowledge semantics to the various software components in order to organize 
computation according to the application and architecture. Indeed, the whole 
software stack, from the application to the scheduler, should be involved in the 
parallelizing, scheduling and locality adaptation decisions by providing useful 
information to the other components. 

Therefore, in OpenMP frameworks, the information extracted by the com- 
piler (about memory affinity and adherence to the same parallel section) can be 



very useful for the guidance of task/thread scheduling. On the other hand, it is 
very important to rely on architecture specific constraints when making these 
scheduling decisions. A tight interaction between the OpenMP stack and the 
underlying hardware-aware scheduler is thus required. 

The most delicate point, when dealing with irregular applications, is to ex- 
ploit this knowledge at runtime (during the whole execution time) so as to main- 
tain a good balancing of threads when events arise (task termination, creation 
of new embedded parallel sections, blocking synchronization, etc.). 

In this paper, we propose a hierarchical threading library able to follow/obey 
scheduling directives and advices in a very powerful manner. Scheduling informa- 
tion (affinity, group membership) is attached to bubbles, which are abstractions 
that can recursively group threads or bubbles sharing common properties. 

We report on preliminary experiences on top of a 8-way multi-core NUMA 
machine and we show that running OpenMP applications on top of our run- 
time system greatly enhances performance on hierarchical architectures under 
irregular conditions. We also propose insights regarding the extraction of useful 
information by the compiler for our runtime and discuss the addition of a couple 
of non-standard OpenMP directives that would improve performance. 

2 Scheduling Applications Featuring Nested, Irregular 
Parallelism 

Achieving the best possible performance when programming OpenMP applica- 
tions requires developers to expose the paralleHsm and to explicitly design their 
code to drive its parallel behavior. Therefore, it is quite common nowadays to 
define per-thread specific data structures (in order to avoid false-sharing) and 
use a static, possibly pre-calculated, distribution of the workload to get good 
data locality |MMQ6j . Indeed, this model suits very well regular applications 
with coarse-grain parallelism. 

However, this approach is hardly usable when dealing with irregular applica- 
tions that rather need a dynamic load balancing mechanism. The use of complex 
synchronization schemes, or even blocking systems calls, may also be responsible 
for introducing irregularities regarding the computing load on the available pro- 
cessors. Using OpenMP dynamic scheduling directives can sometimes improve 
performance. In some cases, however, it may penalize data locality or even intro- 
duce false sharing effects, which can severely impact performance on hierarchical 
architectures. 

Another approach is to increase the number of potential parallel tasks using 
nested parallelism, so that threads can be dynamically (re)allocated according to 
the workload disparity. The performance of such a dynamic thread management, 
when supporteciJ, heavily relies on the underlying runtime implementation, but 
also on the underlying operating system's scheduler. This explains why OpenMP 
users have been experiencing poor performance with the nested capabilities of 



^ Nested parallelism is currently an optional feature in OpenMP. 



some OpenMP compilers, and have e nded up p erforming explicit thread pro- 
gramming on top of OpenMP [BSOSlGOM+OOl or expHcitely binding thread 
groups to processors [_Zha06] . 

Nevertheless, there exists some very good implementations of OpenMP nested 
parallelism, such as Omni/ST [TTSYOOj for instance. Such implementations are 
typically based on a fine-grain thread management system that uses a fixed 
number of threads to execute an arbitrary number of filaments, as done in the 
Cilk multithreaded system |FLR98) . The performance obtained over symmetrical 
multiprocessors is often very good, mostly because many tasks can be executed 
sequentially with almost no overhead when all processors are busy. However, 
since these systems provide no support for attaching high level information such 
as memory affinity to the generated tasks, many applications will actually achieve 
poor performance on hierarchical, NUMA multiprocessors. 

One could probably enhance these OpenMP implementations to use affinity 
information extracted by the compiler so as to better distribute tasks or threads 
over the underlying processors. However, since only the underlying thread sched- 
uler has complete control over scheduling events such as processor idleness, block- 
ing syscall or even thread preemption, this information could only be used to 
influence task allocation at the beginning of each parallel section. 

We believe that a better solution would be to transmit information extracted 
by the compiler to the underlying thread scheduler in a persistent manner, and 
that only a tight integration of application-provided meta-data and architecture 
description can let the underlying scheduler take appropriate decisions during 
the whole application run time. In other words, one can see this conflgurable 
scheduler framework as a domain-specifi c languag e enabling scientists to transfer 
their knowledge to the runtime system [GSS"'"06] . 

3 MaGOMP: an Implementation of GNU OpenMP for 
Hierarchical Machines 

To evaluate the potential gain of providing a thread scheduler with persistent 
information extracted by an OpenMP compiler, we have extended the GNU 
OpenMP runtime system (i.e. the libgomp library) so as to rely on the Marcel 
thread library. This library provides facilities for attaching various information to 
groups of threads, together with a framework that helps to develop schedulers 
capable of using these metadata. Scheduling policies are simply developed as 
plug-ins. 

Before describing our extensions to the GNU OpenMP compiler suite, we 
first present the most important features of the Marcel library. 

3.1 The Bubble Scheduling Model 

Marcel is a POSIX-compliant thread library featuring extensions for easily writ- 
ing efficient, customized schedulers for hierarchical architectures. The API of 
Marcel provides functions to group threads using nested sets called bubbles [ThiQSj . 



These abstractions allow programmers to model the relationships between the 
different threads of an application. Figure [T] illustrates this concept: four threads 
are grouped as pairs in bubbles (assuming they work on the same data), which 
are themselves grouped along another thread in a larger bubble (assuming they 
share information less often) . Bubbles allow expression of relationships like data 
sharing, collective operations, or more generally a particular scheduling policy 
need (seriaHzation, gang scheduling, etc.). Hierarchical machines are modelled 
with a hierarchy of runqueues. Each component of each hierarchical level of the 
machine is represented by one runqueue: one per logical processor, one per core, 
one per chip, one per NUMA node, and one for the whole machine. Marcel's 
ground scheduler then uses a hierarchical Self- Scheduling algorithm. Whenever 
idle, a processor scans all runqueues that span it, and executes the first thread 
that is found, from bottom to top. For instance, if the thread is on a runqueue 
that represents a chip, it may be run by any processor of this chip (see Figure [2]). 




Fig. 1. Expressing thread relationships: graphical and tree-based representa- 
tions. 

As mentioned previously. Marcel provides a high-level API for writing pow- 
erful and portable schedulers that manipulate threads, bubbles and runqueues. 
Threads and bubbles are equally considered as entities, while bubbles and run- 
queues are equally considered as scheduling holders, so that we end up with 
entities (threads or bubbles) that we can schedule on holders (bubbles or run- 
queues). Primitives are then provided for manipulating entities in holders. Run- 
queues can be accessed through vectors, and can be walked through thanks to 
"parent" and "child" pointers. Some functions permit to gather statistics about 
bubbles so as to take appropriate decisions. This includes for instance the total 
number of threads and the number of running threads, but also various infor- 
mation such as the accumulated expected and current CPU computation time 
or memory usage, or the cache miss rates. 

Writing a high-level scheduler actually reduces to writing some hook func- 
tions. The main one is actually called when the ground Self-Scheduler encounters 
a bubble during its search for the next thread to execute. The default implemen- 
tation just looks for a thread in the bubble (or one of its sub-bubbles) and 
switches to it. The bubble_tick() hook is called when some time-slice for a 
bubble expires, and hence permits periodic operations on bubbles with a per- 
bubble notion of time. Of course, mere "daemon" threads can also be started for 
performing background operations. As a result, scheduling experts may manipu- 
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Fig. 2. Scheduling of bubbles and threads on the runqueues of a hierarchical 
machine. 



late threads with a high level of abstraction by deciding the placement of bubbles 
on runqueues, or even temporarily putting some bubbles aside (by defining their 
own runqueues that the basic Self-Scheduler will not look at). 

3.2 Generating Bubbles Out of OpenMP Parallel Sections 

The GNU OpenMP compiler jgomj , GOMP, is based on an extension of the 
GCC 4.2 compiler that converts OpenMP pragmas into threading calls. The 
creation of threads and teams is actually delegated to a shared library, libgomp, 
which contains an abstraction layer to map OpenMP threads onto various thread 
implementations. This way, any application previously compiled by GOMP may 
be relinked against an implementation of libgomp on another thread type and 
transparently work the same. 

We used this flexible design to develop MaGOMP, a port of GOMP on top 
of the Marcel threading library in which BubbleSched is implemented. To do so, 
a Marcel adaptation of libgomp threads has been added to the existing abstrac- 
tion layer. We rely on MarceVs fully POSIX compatible interface to guarantee 
that MaGOMP will behave as well as GOMP on pthreads. Then, it becomes pos- 
sible to run any existing OpenMP appHcation on top of BubbleSched by simply 
relinking it. 

Once Marcel threads are created they basically behave by default as native 
pthreads without any notion of team or memory afflnity. BubbleSched hooks have 
been added in the libgomp code to provide information about thread teams by 
creating bubbles accordingly. 

Therefore, when a thread encounters a nested parallel region and becomes 
the master of a new team, it creates a bubble within its currently holding bubble. 
Then, it moves itself into this new bubble and creates the team's slave threads 
inside it. Finally, the master dispatches the workload across the team. Once their 
work is completed, slave threads die while the master destroys the bubble and 
returns to its original team. As shown on Figure [Sj only a few lines of code are 
needed to associate a nested team hierarchy with a bubble hierarchy. 



void gomp_teain_start (void (*fn) (void *) , void *data, unsigned nthread:;, 
struct gomp_work_share *work_share) { 
struct gomp_teain *teain; 
team = new_teain (nthreads , work_share) ; 

... /* Pack 'fn' and 'data' into the 'start_data' structure +/ 

if (nthreads > 1 && team- >prev_ts .team != NULL) { 
/* nested parallelism, insert a marcel bubble */ 
marcel_bubble_t *holder = marcel_bubble_holding_task (thr->tid) ; 
marcel_bubble_init (&team->bubble) ; 
marcel_bubble_insertbubble (holder, &team->bubble) ; 
mcircel_bubble_inserttask (&team->bubble , thr->tid) ; 
marcel_attr_setinitbubble (&gomp_thread_attr , &team->bubble) ; 

} 

for(int i=l; i < nbthreads; i++) { 

pthread_create (NULL, &gomp_thread_attr , 

gomp_thread_start , start_data) ; 

} 

} 



Fig. 3. One-to-One correspondence between MarceVs bubble and GOMP's team 
hierarchies. 

3.3 A Scheduling Strategy Suited to OpenMP Nested Parallelism 

The challenge of a scheduler for the nested parallelism of OpenMP resides in 
how to distribute the threads over the machine. This must be done in a way 
that favors both a good balancing of the computation and, in the case of multi- 
core and NUMA machines, a good affinity of threads, for better cache effects 
and avoiding the remote memory access penalty. 

For achieving this, we wrote a bubble spread scheduler consisting of a 
mere recursive function that uses the API described in section 13.11 to greedily 
distribute the hierarchy of bubbles and threads over the hierarchy of runqueues. 
This function takes in an array of "current entities" and an array of "current 
runqueues". It first sorts the list of current entities according to their computation 
load (either explicitly specified by the programmer, or inferred from the number 
of threads). It then greedily distributes them onto the current runqueues by 
keeping assigning the biggest entity to the least loaded runqueu^, and recurse 
separately into the sub-runqueues of each current runqueue. 

It often happens that an entity is much more loaded than others (because it 
is a very deep hierarchical bubble for instance). In such a case, a recursive call is 
made with this bubble "exploded": the bubble is removed from the "current enti- 
ties" and replaced by its content (bubbles and threads). How big a bubble needs 

^ This algorithm comes from the greedy algorithm typically used for resolving the 
bi-partition problem. 



to be for being exploded is a parameter that has to be tuned. This may depend 
on the appHcation itself, since it permits to choose between respecting afSnities 
(by puUing intact bubbles as low as possible) and balancing the computation 
load (by exploding bubbles for having small entities for better distribution). 

This way, affinities between threads are taken into account: since they are by 
construction in the same bubble hierarchy, the threads of the same external loop 
iterations are spread together on the same NUMA node or the same multicore 
chip for instance, thus reducing the NUMA penalty and enhancing cache effects. 

Other repartition algorithms are of course possible, we are currently working 
on a even more afSnity-based algorithm that avoids bubble explosions as much 
as possible. 

4 Performance Evaluation 

We validated our approach by experimenting with the BT-MZ appHcation. It is 
one of the 3D Fluid-Dynamics simulation applications of the Multi-Zone version 
of the NAS Parallel Benchmark jdWJ03j 3.2. In this version, the mesh is split 
in the x and y directions into zones. ParalleHzation is then performed twice: 
simulation can be performed rather independently on the different zones with 
periodic face data exchange (coarse grain outer parallelization) , and simulation 
itself can be parallelized among the z axis (fine grain inner parallelization). 
As opposed to other Multi-Zone NAS Parallel Benchmarks, the BT-MZ case is 
interesting because zones have very irregular sizes: the size of the biggest zone 
can be as big as 25 times the size of the smallest one. In the original SMP source 
code, outer parallelization is achieved by using Unix processes while the inner 
parallelization is achieved through an OpenMP static parallel section. Similarly 
to Ayguade et al. [AGMJOi] . we modified this to use two nested OpenMP static 
parallel sections instead, using Uq * rii threads. 

The target machine holds 8 dual-core AMD Opteron 1.8GHz NUMA chips 
(hence a total of 16 cores) and 64GB of memory. The measured NUMA factor 
between chip^ varies from 1.06 (for neighbor chips) to 1.4 (for most distant 
chips). We used the class A problem, composed of 16 zones. We tested both 
the Native POSIX Thread Library of Linux 2.6 (NPTL) and the Marcel library, 
before trying the Marcel library with our bubble spread scheduler. 

We first tried non-nested approaches by only enabling either outer parallelism 
or inner paralleHsm, as shown in Figure [H 

Outer parallelism(no * 1): Zones themselves are distributed among the pro- 
cessors. Due to the irregular sizes of zones and the fact that there is only a 
few of them, the computation is not well balanced, and hence the achieved 
speedup is Hmited by the biggest zones. 

Inner parallelism(l * Ui): Simulation in zones are performed sequentially, but 
simulations themselves are parallelized among the z axis. The computation 

^ The NUMA factor is the ratio between remote memory access and local memory 
access times. 
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Fig. 4. Outer parallelism {uo * 1) and inner parallelism (1 *ni). 

balance is excellent, but the nature of the simulation introduces a lot of 
inter-processor data exchange. Particularly because of the NUMA nature of 
the machine, the speedup is hence Hmited to 7. 

So as to get the benefits of both approaches (locality and balance) , we then 
tried the nested approach by enabhng both paralleHsms. As discussed by DuRAN 
et al. |DGC05j . the achieved speedup depends on the relative number of threads 
created by the inner and the outer parallelisms, so we tried up to 16 threads for 
the outer parallelism (i.e. the maximum since there are 16 zones), and up to 8 
threads for the inner parallelism. The results are shown on Figure [H The nested 
speedup achieved by NPTL is very Hmited (up to 6.28), and is actually worse 
than what pure inner parallelism can achieve (almost 7, not represented here 
because the "Inner" axis maximum was truncated to 8 for better readability). 
Marcel behaves better (probably because user threads are more lightweight), 
but it still can not achieve a better speedup than 8.16. This is due to the fact 
that neither NPTL nor Marcel takes affinities of threads into account, leading 
to very frequent remote memory accesses, cache invahdation, etc. We hence 
used our bubble strategy to distribute the bubble hierarchy corresponding to 
the nested OpenMP parallelism over the whole machine, and could then achieve 
better results (up to 10.2 speedup with 16*4 threads). This improvement is due 
to the fact that the bubble strategy carefully distribute the computation over 
the machine (on runqueues) in an affinity-aware way (the bubble hierarchy). 

It must be noted that for achieving the latter result, the only addition we 
had to do to the BT-MZ source code is the following Hne: 

call marcel_set_load(int (proc_zone_size (myid+1) ) ) 
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Fig. 5. Nested parallelism. 



that explicitly tells the bubble spread scheduler the load of each zone, so that 
they can be properly distributed over the machine. Such a clue (which could 
even be dynamic) is very precious for permitting the runtime environment to 
make appropriate decisions, and should probably be added as an extension to 
the OpenMP standard. Another way to achieve load balancing would be to create 
more or less threads according to the zone size [AGMJ04) . This is however a bit 
more difficult to implement than the mere function call above. 



5 Conclusion 

In this paper, we discussed the importance of establishing a persistent coop- 
eration between an OpenMP compiler and the underlying runtime system for 
achieving high performance on nowadays multi-core NUMA machines. We showed 
how we extended the GNU OpenMP implementation, GOMP, for making use of 
the flexible Marcel thread library and its high-level bubble abstraction. This per- 
mitted us to implement a scheduling strategy that is suited to OpenMP nested 
parallelism. The preliminary results show that it improves the achieved speedup 
a lot. 

At this point, we are enhancing our implementation so as to introduce just- 
in-time allocation for Marcel threads, bringing in the notion of "ghost" threads, 
that would only be allocated when first run by a processor. In the short term, 
we will keep validating the obtained results over several other OpenMP applica- 
tions, such as OndesSD (French Atomic Energy Commission). We will compare 
the resulting performance with other OpenMP compilers and runtimes. We also 
intend to develop an extension to the OpenMP standard that will provide pro- 
grammers with the ability to specify load information in their applications, which 
the runtime will be able to use to efficiently distribute threads. 

In the longer run, we plan to extract the properties of memory affinity at the 
compiler level, and express them by injecting gathered information into more 
accurate attributes within the bubble abstraction. T hese prope rties may be ob- 
tained either thanks to new directives a la UPC |CDC+99] or be computed 



automatically via static analysis |SGDA05j . For instance, this kind of infor- 
mation is helpful for a bubble-spreading scheduler, as we want to determine 
which bubbles to explode or to decide whether or not it is interesting to apply a 
migrate-on-next-touch mecanism ^NLRHQ6) upon a scheduler decision. All these 
extensions will rely on a memory management library that attaches information 
to bubbles according to memory affinity, so that, when migrating bubbles, the 
runtime system can migrate not only threads but also the corresponding data. 



■* The UPC forall statement adds to the traditional for statement a fourth field 
that describes the affinity under which to execute the loop 



6 Software Availability 



Marcel and BubbleSched are available for download within the PM2 distri- 
bution at http : //runtime . f uturs . inria.fr/Runtime/logiciels .html under 
the GPL license. The MaGOMP port of libgomp will be available soon and may 
be obtained on demand in the meantime. 
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